OcrV1, Main, Exploration, bibRecord, 002252

Automatic acquisition of lexical knowledge from sparse and noisy data

Identifieur interne : 002252 ( Main/Exploration ); précédent : 002251; suivant : 002253

Automatic acquisition of lexical knowledge from sparse and noisy data

Auteurs : René Schneider [Allemagne, Colombie]

Source :

Lecture Notes in Computer Science [ 0302-9743 ] ; 1998.

RBID : ISTEX:BDBD4F0D8A169A0292B40A398C73FA27F8857114

Descripteurs français

Pascal (Inist)
- Acquisition connaissance, Algorithme apprentissage, Compréhension langage, Linguistique mathématique, Reconnaissance optique caractère, Traitement langage.

English descriptors

KwdEn :
- Computational linguistics, Knowledge acquisition, Language comprehension, Language processing, Learning algorithm, Optical character recognition.

Abstract

Abstract: Optical character recognition (OCR) still garbles a considerable amount of information reduction and noise on texts so that many documents are unsuitable for information extraction systems. This paper introduces a statistical method for bootstrapping a lexicon from a very small number of “noisy ,” domain-specific texts. This method determines regularity in grammatical forms and also reoccuring ungrammatical forms from the input text. Through a combination of frequency lists and Levenshtein matrices, a language independent, robust core lexicon is constructed that supports the analysis of “noisy texts,” too.

Url:

https://api.istex.fr/document/BDBD4F0D8A169A0292B40A398C73FA27F8857114/fulltext/pdf

DOI: 10.1007/BFb0026670

Affiliations:

Le document en format XML

<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Automatic acquisition of lexical knowledge from sparse and noisy data</title>
<author><name sortKey="Schneider, Rene" sort="Schneider, Rene" uniqKey="Schneider R" first="René" last="Schneider">René Schneider</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:BDBD4F0D8A169A0292B40A398C73FA27F8857114</idno>
<date when="1998" year="1998">1998</date>
<idno type="doi">10.1007/BFb0026670</idno>
<idno type="url">https://api.istex.fr/document/BDBD4F0D8A169A0292B40A398C73FA27F8857114/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001736</idno>
<idno type="wicri:Area/Istex/Curation">001640</idno>
<idno type="wicri:Area/Istex/Checkpoint">001756</idno>
<idno type="wicri:doubleKey">0302-9743:1998:Schneider R:automatic:acquisition:of</idno>
<idno type="wicri:Area/Main/Merge">002369</idno>
<idno type="wicri:source">INIST</idno>
<idno type="RBID">Pascal:98-0263949</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000888</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000B09</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000852</idno>
<idno type="wicri:doubleKey">0302-9743:1998:Schneider R:automatic:acquisition:of</idno>
<idno type="wicri:Area/Main/Merge">002446</idno>
<idno type="wicri:Area/Main/Curation">002252</idno>
<idno type="wicri:Area/Main/Exploration">002252</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Automatic acquisition of lexical knowledge from sparse and noisy data</title>
<author><name sortKey="Schneider, Rene" sort="Schneider, Rene" uniqKey="Schneider R" first="René" last="Schneider">René Schneider</name>
<affiliation wicri:level="3"><country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Department of Speech and Language Understanding, Daimler-Benz AG, Institute of Information Technology, Ulm</wicri:regionArea>
<placeName><region type="land" nuts="1">Bade-Wurtemberg</region>
<region type="district" nuts="2">District de Tübingen</region>
<settlement type="city">Ulm</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Colombie</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<title level="s" type="sub">Lecture Notes in Artificial Intelligence</title>
<imprint><date>1998</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">BDBD4F0D8A169A0292B40A398C73FA27F8857114</idno>
<idno type="DOI">10.1007/BFb0026670</idno>
<idno type="ChapterID">6</idno>
<idno type="ChapterID">Chap6</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Computational linguistics</term>
<term>Knowledge acquisition</term>
<term>Language comprehension</term>
<term>Language processing</term>
<term>Learning algorithm</term>
<term>Optical character recognition</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Acquisition connaissance</term>
<term>Algorithme apprentissage</term>
<term>Compréhension langage</term>
<term>Linguistique mathématique</term>
<term>Reconnaissance optique caractère</term>
<term>Traitement langage</term>
</keywords>
</textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: Optical character recognition (OCR) still garbles a considerable amount of information reduction and noise on texts so that many documents are unsuitable for information extraction systems. This paper introduces a statistical method for bootstrapping a lexicon from a very small number of “noisy ,” domain-specific texts. This method determines regularity in grammatical forms and also reoccuring ungrammatical forms from the input text. Through a combination of frequency lists and Levenshtein matrices, a language independent, robust core lexicon is constructed that supports the analysis of “noisy texts,” too.</div>
</front>
</TEI>
<affiliations><list><country><li>Allemagne</li>
<li>Colombie</li>
</country>
<region><li>Bade-Wurtemberg</li>
<li>District de Tübingen</li>
</region>
<settlement><li>Ulm</li>
</settlement>
</list>
<tree><country name="Allemagne"><region name="Bade-Wurtemberg"><name sortKey="Schneider, Rene" sort="Schneider, Rene" uniqKey="Schneider R" first="René" last="Schneider">René Schneider</name>
</region>
</country>
<country name="Colombie"><noRegion><name sortKey="Schneider, Rene" sort="Schneider, Rene" uniqKey="Schneider R" first="René" last="Schneider">René Schneider</name>
</noRegion>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002252 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 002252 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:BDBD4F0D8A169A0292B40A398C73FA27F8857114
   |texte=   Automatic acquisition of lexical knowledge from sparse and noisy data
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

Serveur d'exploration sur l'OCR

Automatic acquisition of lexical knowledge from sparse and noisy data

Automatic acquisition of lexical knowledge from sparse and noisy data

Source :

Descripteurs français

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.